Skip to content

Conversation

@LennartPurucker
Copy link
Collaborator

@LennartPurucker LennartPurucker commented Sep 5, 2025

This code adds LimiX (https://arxiv.org/pdf/2509.03505) from https://github.com/limix-ldm/LimiX

Notes

  • I have been using and running the non-retrieval version. In my small experiments, the retrieval version sometimes took more than one hour for one small dataset and had worse performance on most TabArena datasets. I feel like there must be a bug in the code or the way I am using it?
  • I have not compiled the model and am using torch's native flash attention. So maybe one could make the model faster.
  • To run the method in a typical data science pipeline, I had to "single"-thread DDP. Sadly, the original code has no option without major changes to just disable DDP in the first place. This would be a big TODO for proper integration (besides code quality improvements in other areas and a closer alignment to the sklearn API). One also needs to fix some of the code related to shutting down DDP. Just to be clear, while the inference without DDP is False, this code here is the problem: https://github.com/limix-ldm/LimiX/blob/main/inference/inference_method.py#L18 (and I changed it in the code in this PR).
  • When trying to get the method to run, I noticed that this code https://github.com/limix-ldm/LimiX/blob/main/inference/predictor.py#L321, because of the .squeeze() will crash if one runs on a dataset with just one feature (in one of the preprocessing configs). I removed the .squeeze() and am not sure why it was here in the first place, as a randomly appearing dimension sounds more like a bug.
  • I also replaced all the lambda functions with real functions or partial usage to make LimiX pickle-able (https://github.com/limix-ldm/LimiX/blob/main/inference/preprocess.py#L435).
  • LimiX does not support installation from GitHub or Pip so far, which is quite unfortunate.
  • The cache path used in the tutorial for downloading the model is not the path one should use on all systems. I added the TabPFNv2-path logic to make this stable.
  • It is unclear how the default configs were found / optimized but it might be good to get a search space we can tune over to see how much better we can make LimiX.

Performance on the TabPFN-subset of TabArena-Full

image

I have tried running the method on more datasets as well, and it worked. However, for the larger dataset (e.g. 50k samples, 130 features) in TabArena, it runs out of VRAM (given 40 GB VRAM). So for now, I will stick to this subset.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@LennartPurucker
Copy link
Collaborator Author

For the sake of completion, here are the results with retrieval (see the prefix [RET]) for TabArena-Lite, again the TabPFN subset.
On this subset of small datasets, retrieval was not able to run within the time limit of 1.5 hours (including overhead) on 24.24% of the datasets.

image

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Sep 17, 2025

I added batching the test predictions, which allowed me to run it on a few more datasets.

Even with this batching, I am not able to run LimiX on an H200 with the current setup for the OpenML task IDs 363628 and 363673, as they have too many samples: 363628 has 90k training samples, 21 features; 363673 has 100k training samples, 10 features.

At the same time, the datasets that I can now run with batching on an H200 (roughly up to a size of 70k training samples) also take very long to predict (multiple hours, even longer due to batching the test predictions).

I will postpone further investigation for larger datasets until an update from the authors and otherwise stick to the TabPFN-subset limits, as these seem roughly in scope in terms of efficiency.

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Sep 17, 2025

Here are the results on TabArena-Full for all datasets.

For LimiX we had to impute two datasets (4%), that is, the tasks that ran out of VRAM as mentioned above.
All other imputed foundation models had to get many more datasets imputed (see the official LB for numbers) as they were not run on an H200.
image

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Nov 11, 2025

The authors of LimiX published a new, smaller model and updated their code.

Moreover, they benchmark on TabArena datasets in their paper. Yet, the results in the paper are problematic (holdout validation, weak baselines, ...); bascially, most of the errors TabArena wants to avoid. Moreover, the regression CD plot looks weird as well, likely because they don't have enough datasets in their subset. Hence, I think it would be good to get a new snapshot of their method. I contacted the authors to see if they want to help with this.

Besides, I need to first rebase/merge this PR with all our recent main-line changes. But the model code should stay similar.

@limix-ldm
Copy link

Regarding the points you raised, we believe there are some mismatches here:

(1) Holdout Validation: Our model evaluation strictly follows standard practices in the machine learning community. As described in our technical report, for public benchmarks (e.g., OpenML datasets), we use the official train/test splits provided by the benchmark. For models that require hyperparameter tuning, we perform cross-validation only on the training set to select the optimal hyperparameters. Final evaluation is always conducted on a held-out test set with no leakage. We consider this a fair and widely accepted protocol. For transparency and reproducibility, the complete list of datasets is publicly available on GitHub (https://github.com/limix-ldm/LimiX/tree/main/benchmark_list). That said, we acknowledge that alternative, equally valid methodologies exist in academic research, and we welcome diverse perspectives. We do not view differing approaches as inherently “problematic.”

(2) Baseline Models: In the interest of fair comparison, we follow the implementation and hyperparameter settings provided in TALENT (https://github.com/LAMDA-Tabular/TALENT) for all NN-based models. Tree-based models undergo hyperparameter optimization via Optuna, and AutoGluon employs its built-in hyperparameter search. Importantly, LimiX is evaluated without any hyperparameter tuning. The reported result of LimiX use a single, generic set of hyperparameters that are neither dataset-specific nor benchmark-specific. Thus, we consider these reasonable baselines for comparison.

(3) TabArena Regression Subset: Our inclusion criteria for the TabArena regression subset follow the standards commonly adopted in other benchmarks: ≤ 50,000 training samples and ≤ 10,000 features. Based on these criteria, all 13 regression datasets referenced in the TabArena paper are included in our benchmark (see the GitHub link above).

image

Accordingly, the differences observed in the CD diagram likely arise from the fact that some prior works (e.g., PFN) evaluated only a smaller subset of the data (e.g., 7 out of the 13 datasets), whereas our analysis covers the full set of 13 regression datasets. It is also worth noting that our implementation of the CD plot follows the same strategy used in https://github.com/hfawaz/cd-diagram and TALENT (see the GitHub link above). In these implementations, the average-rank comparison is replaced by a Wilcoxon signed-rank test (Wilcoxon, 1945) with Holm’s alpha correction (Holm, 1979; Garcia and Herrera, 2008).

We greatly appreciate the dialogue and remain open to further discussion and collaboration.

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Dec 5, 2025

Heyho @limix-ldm, thank you for the reply!
Great to see we can keep the dialogue open and make sure we can get the best version of LimiX onto TabArena :)

I think we slightly disagree on some points, so I wanted to share some of my thoughts on these points below:

(1) Inner holdout validation is very well known to bias results. This was a big mistake of many prior benchmarks, which TabArena avoids. Outer holdout validation can be appropriate in some cases, such as TabArena-Lite, but should still be avoided when reporting final results. I'm happy to talk more about this later when needed, but for now, I recommend checking out our NeurIPS paper for the latest state of the literature on this topic.

(2) The problem of TALENT is that the baselines are weak and, in some cases, not implemented in pipelines that obtain peak performance (also check out our paper for a lot of details on this topic). I agree that they are reasonable baselines, but they are not enough to claim state-of-the-art when beating them. In TabArena, we ensure all baselines are as strong as possible. And thus, it is much harder to claim state-of-the-art, but when one is state-of-the-art, it is very representative.

(3) I have no problems with the used datasets in this case. Note that the CD plot would look even weirder with less datasets, as in the TabPFN-subset -- but it does not for TabArena (checkout the generated results, as we also compute CD plots for all subsets). Small note: you are not using a CD plot, if you use the implementation from hfawaz/cd-diagram. A CD plot requires a critical difference by definition, which is not given by the Wilcoxon-Holm approach (see https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf). I recommend using the CD plot from Autorank (https://github.com/sherbold/autorank).


The interpretation of your results depends on what you claim based on these points. There is nothing inherently wrong with them, but they might not be entirely accurate, as we know there are better ways to do benchmarking. Sadly, the tabular benchmarking community has been ignoring the better ways for too long. With TabArena, we aim to change this and hence have to call out such behavior as well!

As a closing thought: note that in your work, you have compared on the datasets of TabArena, not the TabArena benchmark. To guarantee a fair comparison, it is essential to use the same pipeline as all baselines, as we require and recommend with TabArena. Just comparing on the datasets from TabArena with your own pipeline and with a different evaluation protocol removes all the improvements to benchmarking we introduced with TabArena, making the results less robust.

@limix-ldm
Copy link

limix-ldm commented Dec 5, 2025

Thanks for the reply. This is a valuable starting point for highlighting the limitations of current benchmarks. We will study it and consider incorporating the TabArena pipeline in future work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants